Welcome to Reproducible Medical Research with R (RMRWR). I hope that this book meets your needs.
This is a book for anyone in the medical field interested in analyzing the data available to them to better understand health, disease, or delivery of care. This could include nurses, dieticians, psychologists, and PhDs in related fields, as well as medical students, residents, fellows, or doctors in practice.
I expect that most learners will be using this book in their spare time at night and on weekends, as the medical school curriculum is already packed full of information, and there is no room to add skills in reproducible research to the standard curriculum. This book is designed for self-teaching, and many hints and solutions will be provided to avoid roadblocks and frustration.
Many learners find themselves wanting to develop reproducible research skills after they have finished their training, and after they have become comfortable with their clinical role. This is the time when they identify and want to address problems faced by patients in their practice with the data they have before them. This book is for you.
Thank you for giving this e-book a try. This is designed for physicians or others analyzing health data who are interested in pursuing this field using the R computer language.
We will assume that:
How to download and install R and RStudio will be addressed, step by step, in Chapter 2.
This book is structured on the concept of a “spiral of success”, with readers learning about topics like data visualization, data wrangling, data modeling, reproducible research, and communication of results in repeated passes. These will initially be at a superficial level, and at each pass of the spiral, will provide increasing depth and complexity. This means that the chapters on data wrangling will not all be together, nor the chapters on data visualization. Our goal is to build skills gradually, and return to (and remind students of) their previously built skills in one area and to add to them. The eventual goal is for learners to be able to produce, document, and communicate reproducible research to their community.
Most medical people who learn R to do their own data analysis do it on their own time. They rarely have time for a semester-long course, and their clinical schedules usually will not allow it. Fortunately, a lot of people learn R on their own, and there is a strong and supportive R Community to help new learners. A 2019 Twitter survey conducted by @RLadies found that more than half of respondents were largely self-taught, from books and online resources.
There are a lot of good resources for learning R, so why one more? In part, because the needs of a medical audience are often different. There are distinct needs for protecting health information, generating a descriptive Table One, using secure data tools like REDCap, and creating standard medical journal and meeting output in Word, Powerpoint, and poster formats.
More and more, all science is becoming data science. We are able to track patients, their test results, and even the individual pixels (voxels) of their CT scans electronically, and use those data points to develop new knowledge. While one could argue that health care workers should collect data and bring it to trained statisticians, this does not work nearly as well as you might expect. Most academic statisticians are incentivized to develop new statistical methods, and are not very interested (nor incentivized) to do the hand-holding required to wrangle messy clinical data into a manuscript.
There also are simply not enough statisticians to meet the needs of medical science. Having clinicians on the front lines with some data science training makes a big difference, whether in 1854 in London (John Snow) or in 2014 in Flint, Michigan (Mona Hanna-Atisha). Having more clinicians with some training will impact medical care, as they will identify local problems that would have otherwise never reached a statistician, and probably never been addressed with data otherwise.
Beginning as far back as 1989, with the David Baltimore case, and increasingly and publicly through the 2010s, there has been a rising tide of realization that a lot of taxpayer-funded science is done sloppily, and that our standards as scientists need to be higher. The line between carelessly-done science and outright fraud is a thin one, and the case can be made that doing science in a sloppy fashion defrauds the funders, as it leads to results that can not be reproduced by the authors nor replicated by others. Particularly in medicine, where incorrect findings can cause great harm, we should take special care to do scientific research which is well-documented, reproducible, and replicable. This topic as a motivating force for doing careful medical research will be expanded upon in Chapter 1.
There are several icons at the top left, to the right of the clickable RMWR link, that can be helpful:
1. The Table of Contents Sidebar - Click on the ‘hamburger’ menu icon (three horizontal lines) or the s key to toggle the sidebar (table of contents) on and off. Within the sidebar, you can click on whichever chapter or subsection you want.
2. This book is Searchable - Click on the magnifying glass or use the f key to toggle the Find box and search for whatever you need to find.
3. You can change the font size, font, and background by clicking on the A icon.
4. You can download the chapter with the download icon (downward arrow into a file tray) in PDF or EPUB formats.
This is not an introduction to statistics. I am assuming that you have learned some statistics somewhere in secondary school, undergraduate studies, graduate school, or even medical school. There are lots of statisticians with Ph.D.s who can certainly teach statistics much more effectively than I can. While I have a master’s degree in Clinical Research Design and Statistical Analysis (isn’t that a mouthful!) from the University of Michigan, I will leave formal teaching of statistics to the pros.
If you need to brush up on your statistics, no worries. There are several excellent (and free!) e-books on that very topic, using R. Some good examples include (go ahead and click through the blue links to explore):
We will cover much of the same material as these books, but with a less theoretical and more applied approach. I will focus on specific medical examples, and emphasize issues (like Protected Health Information) that are particularly important for medical data. I am assuming that you are here because you want to analyze your own data in your (probably) very limited free time.
This book is also far from comprehensive in teaching what is available in the R ecosystem. This book should be considered a launch pad. Many of the later chapters will give you a taste of what is available in certain areas, and guide you to resources (and links) that you can explore to learn more and do more beyond the scope of this book. The R computer language has expanded far beyond statistics, and allows you to do many powerful things to improve your workflow, make amazing graphics, and share results with others.
Keep an eye out for helpful Guideposts, which look like this:
Warnings
This is a common syntax error, especially for beginners. Watch out for this.
Tips
This is a helpful tip for debugging.
Try It Out
Take what you have learned and try it yourself in the code box below.
Challenge - take the next step and try a more challenging example.
Try this more complicated example.
Explore More - resources for learning more about a particular topic.
If you want to learn more about Shiny apps, go to https://mastering-shiny.org to see an entire book on the topic.
Throughout this book you will find code examples and demonstrations, and interactive exercises in which you can practice writing R code right in the book. Let’s explain how to use these demonstration flipbooks and learnr exercises.
Flipbooks are windows in this book in which you can watch R code being built into pipelines, and see the results at each step. Each flipbook demonstrates some important code concepts, and often new functions in R. You can click on the window to activate it, then use the left and right arrow keys to go forward and back in the code, one step at a time. You will want to go through these slowly, and make sure that you understand what is happening in each step. You may even want to take notes, particularly on the function syntax, as you will likely coding exercises with these functions shortly after the flipbook demonstration.
Take a look at the example of a flipbook below.
Activate it by clicking on it, and step through the pipeline of code with the right and left arrow keys. Watch the results of each step.
Learnr coding exercises are windows in this book in which you can write your own R code to solve a problem. Each learnr exercise tests whether you have mastered important code concepts, and often new functions in R. If needed, you can reset to a fresh code window with the Start Over button. You can type lines of code into the window, then click on the Run Code button at the top right to run the code and get your results. Your code may not produce the right result the first time, and you will have to interpret the error message to figure out how to fix it. Rely on the text and your notes and the demonstrations to help you. If you are stuck, you can click on the Hint button to see an example of correct code, and compare it to your own. If you would like, you can even copy this code to the clipboard with the Copy button and
Take a look at the example of a learnr exercise below.
There is a dataset piped into a series of functions (‘verbs’), with a blank. Fill in the blank with ‘p_vol’ (without the quotes), which stands for the variable prostate volume. Then run your code with the Run Code button to get a result. Practice using the Start Over button, the Hint button (there may be more than one - usually the last one is the solution), and the Copy To Clipboard button.
When you get a table of data as a result from a code pipeline, it may have more columns (variables) than can be displayed easily. When this is the case, there will be a black arrow pointing rightward at the top right of the table of results. Click on this to scroll right and see more columns.
A table of data as a result from a code pipeline may also have more rows (observations) than can be displayed easily. When this is the case, the table will be paginated, with 10 rows per page. At the bottom right of the table, there will be a clickable listing of pages, along with Previous and Next buttons. Click on these buttons (or the page number buttons) to see more pages of data to inspect your results.
An important note on coding: you should always have an internet search window open when you are writing code. No one can remember every function, nor the correct arguments and syntax of each function. A critical skill in writing code is searching for how to do something correctly. This is not a sign of weakness. Professional programmers google “how do I do x?” hundreds of times a day. This is how programming is done. You will often search for things like “how do I do x in R?” or “how to x in tidyverse”. This is completely normal, and to be expected. You do not have time to memorize hundreds of functions, and you may have days or even weeks between coding sessions (because of your day job), making it hard to remember all the details from your last coding session. This is not a problem. There are lots of websites that can help you solve specific problems, as you will find in the How to Find Help chapter.
One of the most intimidating parts of getting started with something new is the actual getting started part. Don’t worry, I will walk you through this step-by step.
While in many chapters, I will list the R packages you need, in this chapter, you will be downloading and installing new software, so I will list the links here for your reference.
This Chapter is part of the TOOLS pathway. Chapters in this pathway include
R is a statistical programming language, designed for non-programmers (statisticians). It is optimized to work with data in rectangular tables of rows (observations) and columns (variables). It is a very fast and powerful programming engine, but it is not terribly comfortable or convenient. R itself is not terribly user-friendly. It is a lot like a drag racing car, which is basically a person with a steering wheel strapped to an airplane engine.
drag racer
Very aerodynamic and fast, but not comfortable for the long run (more than about 8 seconds). You will need something more like a production car, with a nice interior and a dashboard, and comfy leather seats.
dashboard
This equivalent of a comfy coding environment is provided by the RStudio
IDE (Integrated Developer Environment). I want you to install both R and
RStudio, in that order.
Let’s start with installing R.
R is free and available for download on the web. Go to the r-project
website to get started.
This screen will look like this
You can see from the blue link (download R) that you can use this
link to download R, but you will be downloading it faster if you pick a
local CRAN mirror.
You might be wondering what CRAN and CRAN Mirrors are. Nothing to do
with cranberries, fortunately. CRAN is the Comprehensive R Archive
Network. Each site (mirror) in the network contains an archive of all R
versions and packages, and the sites are scattered over the globe. A
CRAN Mirror maintains an up to date copy of all of the R versions and
packages on CRAN. If you use the nearest CRAN mirror, you will generally
get faster downloads.
At this point, you might be wondering what a package is…
A package is a set of functions and/or data that you can download to
upgrade and add features to R. It is like a downloadable upgrade to a
Tesla vehicle that lets you play the video game Witcher 3 on your
console, but more useful.
Another useful analogy for packages is that they are like apps for a smartphone. When you buy your first smartphone, it only comes with the basic apps that allow it to work as a phone, but a notepad and a calculator.
If you want to do cool things with your smartphone, you download apps that allow your smartphone to have new capabilities. That is what packages do for your installation of R.
Now let’s get started. Click on the blue link that says “download R”.
This will take you to a page to select your local CRAN Mirror , from
which you will download R.
cran
Scroll down to your local country (yes, the USA is at the bottom), and a CRAN mirror near you. This is an example from the state of Michigan, in the USA.
usa-mirrors
Once you click on a CRAN Mirror site to select the location, you will be taken to the actual Download site.
install
Select the link for the operating system you want to use. We will walk through this with Windows first, then Mac. If you are using a Mac, skip forward to the Mac install directions. If you are computer-savvy enough to be using Linux, you can clearly figure it out on your own (it will look a lot like these).
If you are installing R on a Mac, jump ahead to the Mac-specific version below.
On windows, once you have clicked through, your next screen will look
like this:
install2
You want to download both base and Rtools (you might need Rtools later). The base link will take you to the latest version, which will look something like this.
install3
Click on this link, and you will be able to save a file named R-N.N.N-win.exe (Ns depending on version number) to your Downloads folder. Click on the Save button
to save it.
install4
Now, go to your Downloads folder in Windows, and double click on the R installation file (R-N.N.N-win.exe). Click Yes to allow this to install.
install5exe
Now select your language option.
install_language
You will be asked to accept the GNU license - do so. Click Yes to allow this to install. Then select where to install - generally use the default- a local (often C) drive - do not install on a shared network drive or in the cloud.
install_drive
Then select the Components - generally use the defaults, but newer computers can skip the 32 bit version.
install_comp
In the next dialog box, accept the default startup options.
install_defaults
You can choose the start menu folder. The default R folder is fine.
install_start
If you want a shortcut icon for R on your desktop, you can leave this checked. But most people start RStudio, with R running within RStudio, rather than directly starting R. You probably won’t need an R shortcut, so leave these unchecked in the next dialog box.
install_addltasks
Then the Setup Wizard will appear - click Finish, and the rest of the installation will occur.
install_wizard
Now you want to test whether your Windows installation was successful. Can you find R and make it work? Hunt for your C folder, then for OS-APPS within that folder. Keep drilling down to the Program Files folder. Then the R folder, and the current version folder within that one (R-N.N.N). Within that folder will be the bin folder, and within that will be your R-N.N.N.exe file. Double click on this to run it. The example paths below can help guide you.
install_path2
install_path
Opening the exe file will produce a classic 2000-era terminal window, called Rterm, with 64 bit if that is what your computer uses. The version number should match what you downloaded. The messaging should end with a “>” prompt.
install_term
At this prompt, type in:
paste(‘Two to the seventh power is’, 2^7)
(don’t leave out the comma or the quotes) - then press the Enter key.
This should produce the following:
Two to the seventh power is 128
install_test
Note that you have explained what is being done in the text, and computed the result and displayed it.
The installation for Mac is very similar, but the windows look a bit different. If you are working with Windows, jump ahead at this point to Installing RStudio. At the Download Version page, you click on the Mac Download. You will then click on the link for R-N.N.N.pkg, and allow downloads from CRAN.
install_path
Then go to Finder, and navigate to the Downloads folder. Click on R-N.N.N.pkg You will then click on the link for R-N.N.N.pkg, and allow downloads from CRAN.
install_downloadmac
Click on Continue on 2 consecutive screens to download
cont1_mac
cont2_mac
Then you need to agree with the License Agreement,
mac_license
then Click on Install, and provide your Mac password for permission to install.
cont1_mac
When the installation is complete, click on the Close button. Accept the prompt to move the installer file to the trash.
Go to Finder, and then your Applications folder. Scroll down to the R file. Double click on this to run it.
findrmac
You should get this 2000-era terminal window named R Console. The version number should match what you downloaded, and the messaging should end with a “>” prompt. At this prompt, type in
paste(“Two to the seventh power is”, 2^7)
(DON’T leave out the comma or the quotes)
rconsolemac
This should result in
mactestR
Awesome. You are now Ready to R!
ready2R
Now that R is working, we will install RStudio. This is an IDE (Integrated Development Environment), with lots of bells and whistles to help you do reproducible medical research.
teslax_dash
This is a lot like adding a dashboard with polished walnut panels, a large video screen map, and heated car seats with Corinthian Leather. Not absolutely necessary, but nice to have.
The RStudio IDE wraps around the R engine to make your experience more comfortable and efficient.
camry_dash
Fortunately, RStudio is a lot cheaper than any of these cars. In fact, it is free and open source. You can download it from the web at:
Click on the RStudio Desktop icon to begin.
download
This will take you to a new site, where you will select the Open Source Edition of RStudio Desktop
open_source
This will take you to a new site, where you will select the Free Version of RStudio Desktop
free
Now select the right version for your Operating syxtem - Windows or Mac.
If you are installing on a Mac, jump ahead now to the Mac-specfic installation instructions.
Now save the RStudio.N.N.N.exe file (Ns will be digits representing the version number) to your downloads folder.
winsave
Now go to your downloads folder, and double click on the RStudio.N.N.N.exe file.
winlaunch
Allow this app to make changes. Click Next to Continue, and Agree to the Install Location.
wininstall
Click Install to put RStudio in the default Start Menu Folder, and when done, click the Finish button.
winsave
winfinish
Now select your preferred language option, accept the GNU license, Click Yes to allow this to install. Select where to install. This is generally on a local (often C:) drive, and usually not a shared network drive or in the cloud.
Now you should be ready to test your Windows installation of RStudio.
Open your Start menu Program list, and find RStudio.
Pin it as a favorite now.
Click to Open RStudio.
Within the Console window of RStudio, an instance of R is started up. Check that the version number matches the version of R that you downloaded.
Now run a test at the prompt (“>”) in the Console window. Type in
paste("Three to the 5th power is", 3^5)
do not leave out the quotes or the comma
Then press the enter key
and this should be your result:
test_result35
A successful result means that you are ready to roll in RStudio and R!
Start at this link: RStudio Download
Select the Free RStudio Desktop Version
mac_download
Then click on the big button to Download RStudio for Mac.
mac_download2
After the Download is complete, go to Finder and the Downloads Folder. Double click on the RStudio.N.N.N.dmg file in your Downloads folder.
mac_dmg
This will open a window that looks like this
mac_apps
Use your mouse to drag the RStudio icon into the Applications folder.
Now go back to Finder, then into the Applications folder. Double click on the RStudio icon, and click OK to Open.
Pin your RStudio to the Dock.
Double Click to run RStudio.
RStudio will open an instance of R inside the Console pane of RStudio with the version number of R that you installed, and a “>” prompt.
Type in
paste("Three to the 5th power is", 3^5)
do not leave out the quotes or the comma
Then press the enter key
and this should be your result.
test_result35
A successful result means that you are ready to roll in RStudio and R!
ready
You now have 6+ adjustments that you need to make in your RStudio Global Settings for optimal R and RStudio use.
Once this Rcode folder is in place, switch back to RStudio. In the RStudio Menus, go to Tools/Global Options. A new Global Options window will open up. Click on the General tab on the left. At the top, there is a small window for identifying your Default working directory. Click on the Browse button, and browse to your new “Rcode” folder and select it. From now on, your R files and Projects will all be in one place and easy to find.
These tune-ups (#2 and #3) to your RStudio will mean you will always start with a clean workspace in a new RStudio session, which will avoid a lot of potential problems later.
This will put your temporary output from Code Chunks into the larger and nicer Viewer tab.
Take a look at the Appearance tab. You can change your code font, the font size, and the theme. I wouldn’t make any drastic changes at this point, but it is good to know that these options are available. Any changes here are entirely optional (and cosmetic) at this point.
in the RStudio menus, select Code, then check/select two options to turn these on:
Now your RStudio installation is tuned and ready to go!
The software program, git, is a version control system. It is the most common version control system in the world. It is free and open source, and is the foundation of reproducible computing.
We won’t be doing a lot with git just yet, but it is helpful to get this installation done and out of the way. It will come up a lot when we start to discuss reproducible research and collaboration.
If you are using Windows, jump ahead to Installing Git on Windows.
At that prompt, type git --version
note that there are 2 dashes before version.
This will tell you the current version of git (2.29.2 as of January 1, 2021), or prompt you to install git.
a. First, let’s check if you have homebrew installed.
Go to the Terminal tab in the Console pane (lower left) in RStudio. A
prompt will appear that ends in a $.
at the prompt, type command -v brew
This should return “/usr/local/bin/brew” if homebrew is installed, or will tell you “brew not found” or something similar.
b. Installing homebrew
At the terminal prompt($), paste in the following:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Then press Enter to run it. This installs the homebrew program, which allows you to install software on macOS that does not come from the Apple App Store
This will take a couple of minutes.
c. Installing git
Once you have homebrew installed, installing git is straightforward.
At the Terminal prompt ($), type
brew install git
and this will quickly install. You will be prompted to click Continue buttons to complete the installation.
Check your installation.
At the Terminal prompt ($), type
git --version
and this should return a result like “git version 2.29.2”, depending on the version number.
If you are using Windows, go to the website, https://git-scm.com/download/win.
This will start the download automatically
Go to your downloads folder and install the downloaded .exe file by clicking on it
Check your installation.
At the Terminal prompt ($), type
git --version
and this should return a result like “git version 2.29.2”, depending on the version number.
If you are using Fedora, or a related version of Linux like RHEL or CentOS, use dnf
At the $ prompt, type sudo dnf install git-all
If you are using a Debian-based version of Linux like Ubuntu, use apt
At the $ prompt, type sudo apt install git-all
For other distributions of Linux, follow the instructions at https://git-scm.com/download/linux.
Check your installation.
At the Terminal prompt ($), type
git --version
and this should return a result like “git version 2.29.2”, depending on the version number.
When you first open the RStudio IDE (Integrated Development Environment), there will be a left side pane, with tabs for Console, Terminal, Rmarkdown, and Jobs.
Just for fun, go to the RStudio menus, and choose File/New File/RScript. This will open a new pane at the top left, which we will call Q1 (quadrant 1), or the top left pane, or the Source pane. This pane will contain tabs for each active script or document, along with tabs for any datasets you have opened up to have a look at.
The Quadrant 2 pane, with tabs for Console, Terminal, Rmarkdown, and Jobs, has now been pushed to the lower left pane. You will use the Console for interactive programming, and as a “sandbox” to test out new code. When your code works and is good enough to save, you will move it to the Source pane and save it to a Script or an Rmarkdown document. Any code that is not saved to Source will be lost (actually it will be somewhere in the History, but it can be a pain to find the version that works later - it is best to save the good stuff to a Script or .Rmd).
The Quadrant 3, or top right pane, includes tabs for your Environment (objects, like datasets, functions, and variables you have defined), History (saving the past in case you forget to, but messy), and Connections tabs for connections to databases. Later a Git tab will be added for version control (backup) of your Source documents.
In Quadrant 4, or the bottom right pane, you will find tabs for your Files, Plots, Packages, Help, and a Viewer for HTML output.
This is material that is also well described in the “Basic Basics 1” section of RLadiesSydney. Check it out at BasicBasics1. There is a nice ~ 15 minute video by Jen Richmond worth watching if you are just getting started. Note that a lot of the other material on this website (RYouWithMe) is very helpful for people new to R.
In this chapter, we will introduce you ways to wrangle rows in R. You will often want to focus your analysis on particular observations, or rows, in your dataset. This chapter will show you how to include the rows you want, and exclude the rows you don’t want. Once your data wrangling and data validation is done, you will be ready for data analysis.
If you have not used a flipbook before, you can click on the frame below to activate it, then use right and left arrow keys to move forward and back through the demo.
With each forward step in the code on the left, examine the resulting output on the right. Make sure you understand how the output was produced.
You can use multiple filters on your data, and combine these with AND OR XOR parentheses and combinations thereof.
You can use == to test exact equality of strings, but you can also use str_detect from the {stringr} package, and combine it with the magic of regex to do complicated filtering on character string variables in datasets.
You can use the {lubridate} package to format strings for logical tests, and filter your observations by date, month, year, etc.
You can use the is.na(), drop_na() and negation with ! to help identify and filter out (or in) the missing data, or observations that are incomplete.
You can use the {janitor} package to help you find duplicated observations/rows for fixing or removal from your dataset.
You can use the slice() family of functions to cut out a chunk of your observations/rows by position in the dataset.
You can use the slice_sample() function to take a random subset of large datasets, or to build randomly selected training and testing sets fo modeling.
Especially when you are starting out, it can be very difficult to interpret error messages, because these can be quite jargon-y.
Let’s start with a table of the most common error messages, and the likely cause in each case.
Note that when reading an error message, there are two parts - the part before the colon, which identifies in which function the error occurred, and the part after the colon, which names the error. A typical error message is usually in the format:
Error in Where the error occurred : what the error was
here is an example
Error in as_flextable(.) : object 'errors' not found
On the left, you are being told that the error occurred when the as_flextable() function was called. This can be helpful if you have run a long pipeline of functions, as it helps you isolate the problem.
On the right, you are being told what the error was. In this case, the function looked for the object errors in the working environment (see your Environment tab at the top right in RStudio), and could not find it.
Note that sometimes syntax errors caused by missing components (a missing comma, a missing parenthesis, a missing pipe symbol %>% , or a missing + sign in a ggplot pipe) will cause an error in the next function in the pipeline. Watch out for this, especially when the function where the error is found looks fine - often it occurs because there is a missing piece just before this function.
Then we will walk through examples of how to create each error, and how to fix them, one by one.
Examine the error message from R, particularly the part that comes after the colon (:). The error messages listed in the left column will be what appears after the colon (:)
| Error Message | What it Means |
|---|---|
| could not find function | This usually means that you made a typographical error in the function name (including Capitalization - R is case-sensitive), or that the package you are intending to use (which contains the function) is not installed - with `install.packages(‘package_name’)` or loaded - with `library(package)` |
| object ‘object-name’ not found | This usually means that the function looked for an object (like a data frame or a vector) in your working environment (check your Environment pane) and could not find it. This commonly happens when you
|
| filename does not exist in current working directory (‘path/to/working/directory’) | This usually means one of three things: (1) you mistyped the name of the file, or part of the path,
|
| error in if | This usually means that you have an *if* statement that is trying to make a branch-point decision, but the logical statement that you wrote is not providing either a TRUE or a FALSE value. The most common reasons are typographical errors s in the logical statement, or an NA in one of the underlying values, which yields an NA from the logical statement. You may need to use a `na.rm = TRUE` option in your logical statement. |
| error in eval | This usually occurs when you are trying to run a function on an object that does not exist in your environment. Check to make sure in your Environment pane, and consider that you may not have saved/assigned the object. Alternatively, you may have a typographical error in the object name. Worth checking. |
| cannot open | This usually occurs when you are attempting to access or read a file that either does not exist, or is not in the folder that you thought it was. Check your working directory and find the file in your file structure. This can often be prevented by working in RStudio projects and using the here() function for paths to files. |
| no applicable method | This usually occurs when you are using a function that expects a particular data structure (vector, list, dataframe), but you have given it a different data structure as the input. Check the data structure of your object, and check the documentation for your function. For example, if you want to use a function that acts on vectors, this function will not work on a dataframe variable. You may have to use the `pull(var)`function to “pull” this variable out of the dataframe into a vector before using this function. |
| subscript out of bounds | you are trying to access an item in an environment object (like a vector, dataframe, or list) that does not exist, like the 9th item in a vector that is 7 items long, or the -1st row of a dataframe. Check the length of the item, and the math that you used to count the item number (loops that go too long are often a culprit) |
| replacement has [x] rows, data has [y] rows | This usually occurs when you are trying to code for a new variable, or replace a variable in a dataframe. But somehow (missing values, NAs), what you are trying to add to the dataframe is not the same length (number of rows) as the rest of the existing dataframe. Use a length() function to check your building of this vector at each step, to figure out where your length went wrong. |
| package not available for R version x.y.z | This occurs when you are trying to install a package, and your R version is newly updated. The problem is that the package version available on CRAN has not caught up to your shiny new version of R. This can happen after an R update when the package developer is working on updating their package, but the new version has not made it onto CRAN yet. This is often fixable if you know where the developer stores their development code (usually on GitHub). For example, if the package is {medicaldata}, and the developer’s Github userid is higgi13425, then you can install the development version of this package with remotes::install_github('higgi13425/medicaldata'). This assumes that you have already installed and loaded the {remotes} package. |
| non-numeric argument to a binary operator | A binary operator, like + or *, is a mathematical operation that takes two values (operands) and produces another value. It gets grumpy when trying to do math on things that are not numbers. A typical input to produce this error would be 1 + 'one' - one operand is numeric, and the other 'one' is a character string - the non-numeric argument. |
| object of type closure is not subsettable | This occurs when you try to extract a subset of something - but it is actually a function, not an object. This most commonly occurs when you try to subset a particular object that does not exist, like df$patient_id or data$sbp, when you have not created the objects df or data. The reason you get this strange error message, rather than simply Error: object 'df' not found , is that df() and data() are defined functions in base R. It is good practice to avoid naming any objects data or df for this reason. It gets very confusing, and this is best avoided. |
This is a very common error. It is easy to lose track of how many sets of parentheses you have open in putting together a complicated function.
Here is an example, where a closing parenthesis is missing from a mutate() function.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
mutate(ratio = t_vol/p_vol,
older_aa = case_when(age >65 & aa == 1 ~ 1,
TRUE ~0) %>%
filter(older_aa ==1)
In this case, no output is produced, and the console does not return to the > prompt. Instead, it offers a + prompt - in effect, asking you for something more. If you type in an extra closing parenthesis (after the filter function), it will give you an error.
The error you get is:
Error: Problem with `mutate()` input `older_aa`. x no applicable method for ‘filter_’ applied to an object of class “c(‘double’, ‘numeric’)” ℹ Input `older_aa` is ``%>%`(…)`.
R identifies a problem with the input “older_aa” to mutate - the parentheses are not closed.
It then fails on the next function - filter, and gives you a strange error message - filter_ applied to… - because the input to the filter step (the next step after the error) was incoherent. This can be a bit confusing. But if you inspect the input older_aa, you will find the mis-matched parentheses. This is much easier to find with “rainbow parentheses” turned on in Tools/Global Options. When this option is on, you can be sure your parentheses are right when you end on red.
In this case, adding the missing parenthesis to the mutate step fixes it.
Parentheses that end on red are all right.
What if you go the other way, with an extra parenthesis after some misguided copy-paste adventures? Let’s see what happens.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
mutate(ratio = t_vol/p_vol,
older_aa = case_when(age >65 & aa == 1 ~ 1,
TRUE ~0))) %>%
filter(older_aa ==1)
In this code block, you will end up with two red closing parentheses, and when you click to the right of the final closing parenthesis, there will be no matching highlighted open parenthesis (note that the preceding closing parentheses both have matching highlighted open parentheses. Both of these are clues that this last one is an extra.
The error you get from R is
Error in filter(older_aa == 1) : object ‘older_aa’ not found
The left side of the error message identifies the filter step as where the error occurs, and the right side of the error message states that the error is an object not found. The error occurs when R gets to the next function. It also tells you that older_aa was not successfully created - suggesting that the problem is in the step before the filter function.
In this case, removing the extra parenthesis from the mutate step fixes it.
%>% in a data wrangling pipelineThis is a common error. It is easy to cut out one of your %>% connectors when you are editing/debugging a data wrangling pipeline.
Here is an example, where a %>% is missing. Can you spot it?
prostate %>%
select(t_vol, p_vol, age, aa)
mutate(ratio = t_vol/p_vol,
older_aa = case_when(age >65 & aa == 1 ~ 1,
TRUE ~0)) %>%
filter(older_aa ==1)
In this case, the error you get is:
Error in mutate(ratio = t_vol/p_vol, older_aa = case_when(age > 65 & aa == : object ‘t_vol’ not found
The left side of the error message identifies the mutate step as where the error occurs, and the right side of the error message states that the error is an object not found. This is a bit misleading, as the problem is not in the mutate step. But mutate is where the pipeline crashes, as it can not find the variable t_vol. You have to backtrack upwards line-by-line to find the error. Every line of a data wrangling pipeline should end in %>%. Since this is such a common error, this should be one of your “usual suspects”. And the select line, just above the mutate line, is where the problem is.
In this case, adding the missing %>% to the end of the select step fixes your data wrangling pipeline.
Use one function per line in a pipeline.
Check every data wrangling pipeline to make sure each step (except the last) ends in a pipe %>%
This is a common error. It is easy to cut out one of your + connectors when you are editing/debugging a ggplot.
Here is an example, where a + is missing in the middle of a ggplot pipeline.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
ggplot(aes(x = factor(t_vol), y =p_vol))
geom_boxplot() +
labs(x = "tumor volume", y = "prostate volume") +
theme_minimal()
In this case, you get a ggplot output, but without any boxplots. It is also missing your custom labels for the x and y axes, and the theme you wanted. Essentially, the code stops running after the initial ggplot() statement and the remaining lines of code are ignored. This can be pretty puzzling, as you do get a plot, but not what you intended. There is a partial plot in the Plots tab, but you get a somewhat helpful error in the Console.
The error you get is:
Error: Cannot add ggproto objects together. Did you forget to add this object to a ggplot object?
R identifies a problem with the last 3 lines of code, starting with geom_boxplot() - it can not add these ggproto objects (the components of a ggplot) to the existing plot. It asks, “Did you forget to add?” which should be a clue that there is a missing + sign between lines of ggplot code. Since the theme and labels are the defaults, and there are no boxplots, suggest that these last 3 lines were not run at all, and that the missing plus sign should be found just before these lines of code.
In this case, adding the missing + to the end of the ggplot step fixes your plot.
Use one function per line in a pipeline.
Check every ggplot pipeline to make sure each step (except the last) ends in a plus sign +
%>% in Place of a +This is a common error. It is easy to start with your dataset, do some data wrangling steps with the pipe %>% and keep piping out of habit, even after you start your ggplot. Unfortunately, once you start to ggplot, you have to use + as your code connector. Having a pipe instead will cause an error.
Here is an example, where a %>% is used instead of + in a ggplot pipeline. It usually happens at the beginning of the ggplot, when you are still in piping mode.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
ggplot(aes(x = factor(t_vol), y =p_vol)) %>%
geom_boxplot() +
labs(x = "tumor volume", y = "prostate volume") +
theme_minimal()
In this case, you will not get a ggplot output, and you will get an error in the console.
The error you get is:
Error: `mapping` must be created by `aes()` Did you use %>% instead of +?
The error message identifies the aes() step as where the error occurs. R identifies a problem that causes the aes function to fail to create a mapping. The first line is not very helpful (other than identifying aes() as a problem), but in the next line, R asks, “Did you use %>% instead of +?” which is very helpful. Once you know this, look at the line where aes() failed. This is where there is a pipe in place of a plus.
In this case, replacing the %>% with a + fixes your plot.
This is a common error. It is easy to start a series of arguments to a function, like multiple variables in a mutate step, and miss a comma between them.
Here is an example, where a comma is missing in a series of mutate steps. Note that it is a good habit to put one mutate step on each line, with each line ending in a comma. This will help you find the missing comma if (no, when) you make this mistake.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
mutate(ratio = t_vol/p_vol,
older_aa = case_when(age >65 & aa == 1 ~ 1,
TRUE ~0)
age_decade = floor(age / 10)) %>%
filter(older_aa ==1)
In this case, you will not get a tibble output, and you will get an error in the console.
The error you get is:
Error in filter(older_aa == 1) : object ‘older_aa’ not found
The left side of the error message identifies the filter step as where the error occurs, and the right side of the error message states that the error is an object not found. R identifies a problem that causes the filter function to fail, but this is actually a problem in the line prior. The variable older_aa was not created and is not available to filter. It should have been created in the mutate step, but this step is where the failure occurred. Because you formatted the mutate step with one mutate statement per line, it is easy to check each line for a comma - and the older_aa line is missing its comma.
In this case, adding a comma at the end of the older_aa line (after “TRUE ~0)” fixes your data wrangling pipeline.
This is a common error. You may have created or modified a dataframe, but forgot to assign it to a new object name. Or maybe you did this assignment in a different session, but have not done it in your current session. Or maybe you made a typographical error in calling the object (“covvid” instead of “covid”). Either way, this object is not yet loaded into your computing environment (the Environment tab).
In this example, we request data from the {medicaldata} package, but forget to assign it to an object.
So it does not exist when we try to use it to start a pipeline. This does not work.
medicaldata::covid_testing
covid %>%
select(subject_id, age, result, ct_result, patient_class) %>%
mutate(high_titer = case_when(ct_result < 18,
TRUE ~ 0),
age_decade = floor(age / 10)) %>%
filter(age >50)
In this case, you will not get a tibble output, and you will get an error in the console.
The error you get is:
Error in select(., subject_id, age, result, ct_result, patient_class) : object ‘covid’ not found
The portion to the left of the comma identifies where the error occurs - in the select step. The portion to the right of the comma identifies the error. This one is easy. The object ‘covid’ was not found. You can check your Environment pane, and it will not be there. What the coder intended was to call medicaldata::covid_testing and assign it (with an arrow) to a new object named covid. But that assignment did not happen, and R is unable to guess what you meant.
In this case, adding an assignment arrow -> to the end of the medicaldata::covid_testing line and then covid completes the assignment, creates the covid object, and
fixes your data wrangling pipeline.
This is a very common error. The equals sign is commonly used in two ways in R.
It is very common to use one equals sign in a logical statement. This causes errors. Watch the last filter step below.
prostate %>%
select(t_vol, p_vol, age, aa) %>%
mutate(ratio = t_vol/p_vol,
older_aa = case_when(age >65 & aa == 1 ~ 1,
TRUE ~0),
age_decade = floor(age / 10)) %>%
filter(older_aa =1)
In this case, the error you receive is very helpful:
Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you’ve used `=` instead of `==`. ℹ Did you mean `older_aa == 1`?
The problem is with the filter step. The error starts out very jargon-y. “input `..1`. x Input `..1` is named” - means the input to filter is actually named (an assignment). But then it gets a lot more helpful. It recognizes that you have made a common error, and suggests an appropriate fix.
In this case, adding a 2nd equals sign in the filter step fixes your data wrangling pipeline.
Testing for equality with == is a big problem with real numbers, rather than integers. Computers use algorithms to do math which are not quite exact, leading to small differences in decimals. The == equality test is very strict, so that something like sqrt(2)^2 == 2 is FALSE because of small differences far to the right of the decimal point, which can trip you up. You can see these if you run the modulo 2: sqrt(2)^2 %% 2, which gives you the remainder after you divide by 2, which is the very tiny 0.0000000000000004440892. In this situation, you should use the near() function, as near(sqrt(2)^2, 2) is TRUE. The near function has a built-in tolerance of 0.00000001490116, which will be able to handle any computer-generated small, stray decimals. You can set your own tolerance argument if needed.
This happens when you try to do math on things that are not numbers. It usually occurs when you have a variable(column) that looks like it is numeric (it contains numbers), but somewhere along the way it became a character string variable. This often occurs when data are being entered into a spreadsheet, and one value in the column has characters in it. This often happens when you have a column of systolic blood pressures, and one value is entered as “this was not done”, or “102, but taken standing up”. Having comments, even if only one character string in a column in Excel makes the whole column into the character string data type.
This is not apparent until you try to do math with this variable, as in
data %>%
mutate(mean_art_pressure = sbp/3 + 2/3* dbp)
This will give you the error:
Error in mutate(mean_art_pressure: non-numeric argument to binary operator
To fix this, you will have to
Determine which variable, sbp or dbp, is non-numeric (glimpse(data) will help).
Review the values of the problem variable (possibly with table()) to find which is non-numeric.
Fix these values manually in your code, and document with comments
Which values are being fixed (e.g. sbp for subject 007, at visit 2)
data$sbp[subject == 007 & visit == 2] <- 102
What the original value was, and what the new value will be
Who made the change to the data
Why the data change was made
On what date the data change was made
Never over-write your original data - keep a complete audit trail!
This is where the internet comes in handy. Whatever errors you can create, someone has already run into. And they have asked for help on the internet, and most of the time, someone has helped them solve their error.
You should copy your entire error message, and paste it into a web search. Google will often yield multiple similar examples, with various ways to solve the problem.
Remember that the error may have occurred because of a problem in the previous line of code (missing parenthesis, comma, etc.), so don’t forget to check one line above.
The Add-One-Line debugging strategy is a good place to start. Select the code for your pipeline from the beginning to 2 lines of code before the error. If that runs without errors, add one line to your selection, and run it. Keep adding lines to your selection and running until you hit the error. Then try to find the problem and fix it.
If you are running code that has worked before, and it is not working now, it is possible that you have created something odd in your working Environment that is interfering with your code. Sometimes it is an old object from a previous session (it is always better to start from a clean slate). Completely restart your R session (click on Session/Restart R, or use the keyboard shortcut), make sure the Environment is clean, then run your code from start to finish to give it a new try. Sometimes a clean slate will make all the difference.
https://bookdown.org/yih_huynh/Guide-to-R-Book/trouble.html
https://medium.com/analytics-vidhya/common-errors-in-r-and-debugging-techniques-f11af3f1c7d3
https://rpubs.com/Altruimetavasi/Troubleshooting-in-R
https://www.r-bloggers.com/2016/06/common-r-programming-errors-faced-by-beginners/
https://www.r-bloggers.com/2015/03/the-most-common-r-error-messages/
The most important way to update R is to add packages. Each package adds new functions and/or data to R, enabling you to do much more in the R and RStudio environment.
When you open R, or start a new session, you have only the base version of R available, and it is pretty spartan. You can see how many packages you have available to you by starting RStudio and going to the menu Session/New Session, or Session/Restart R. Each of these will give you a clean workspace to start in. Once you have started a new session, or restarted R, run the following code:
print(.packages())
## [1] "medicaldata" "forcats" "stringr" "dplyr"
## [5] "purrr" "readr" "tidyr" "tibble"
## [9] "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods"
## [17] "base"
You will find that you only have 9 packages available, including base, utils, methods, stats, graphics, grDevices, datasets, devtools, and usethis.
In order to use more of the power of R and RStudio, you will need to install packages (a one-time task), and load them (in each session) before use with a library(package_name) function.
If you Google a bit for ways to do things in R, you will find many packages that can be helpful. The most strictly validated packages are hosted on CRAN - a mirrored server. There are now over 20,000 packages on CRAN to do various specialized things in R. These were all useful for someone, so they have shared them on CRAN. To install packages from CRAN, you use the function:
install.packages("package_name")
Notice that the package_name has to be in quotes. These can be single or double quotes. The package_name and install.packages() are case_sensitive like all objects and functions in R, so that something like Install.Packages will not work.
Once the package is installed, you keep that in your R library associated with your current major version of R. You will need to update & reinstall packages each time you update a major version of R. R versions are designated with R version #.#.# A change in the third number indicates a minor version change. A change in the first or 2nd number (from R 3.6.2 to 4.0.0, or 4.0.2 to 4.1.0) is a major version upgrade which will require re-installation of packages.
Let’s practice installing a package. Run the code below to install the tidyverse package.
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/93/s18zkv2d4f556fxbjvb8yglc0000gp/T//RtmpWt7a0M/downloaded_packages
Some packages are still in development. These are often in repositories on github, rather than on the CRAN servers. To install these packages, you need to know path to the repository. You can install the medicaldata package from Github. Run the code below to install this package.
devtools::install_github("higgi13425/medicaldata")
## Using github PAT from envvar GITHUB_PAT
## Skipping install of 'medicaldata' from a github remote, the SHA1 (1c039d8b) has not changed since last install.
## Use `force = TRUE` to force installation
In contrast, to install.packages, the library() function can work with quotes around the package_name, but they are not required. This is because these packages are already installed in your R library, and are known quantities. In general, known objects in your R environment do not require quotes, and novel things like packages do require quotes.
If you re-run print(.packages) at this point, you will not have any more packages. This is because you have installed new packages, but not loaded them.
Sometimes you may run into a problem installing a package which was developed for a previous version of R. Especially if you have recently upgraded your R version recently, the CRAN version of a package may be a bit behind. This can often be fixed by googling for “github” and “package_name”. This will usually lead you to the github repository for that package, which will have a pathname of “github_username/package_name”. Once you know this, you can use
`devtools::install_github(‘github_username/package_name’) to install the newest version of the package, which will usually be compatible with the latest version of R.
Some packages are dependent on specific versions of other packages, and will ask you to update the other packages during installation. As a general rule, you should say ‘yes’. If you are worried about over-writing an existing package in a way that would break your code in a different project, then that project needs its own project-specific library, which you can create with the {renv} package.
Sometimes packages require (depend upon) software that is not part of the R ecosystem. These will generally give you messages during the install process asking you to install this helper software. Common helper software includes things like Fortran and RJava. Sometimes you will need to go to websites, or use software like Homebrew (on the Mac) to install these extra helper pieces of software.
Run the code chunk below to load both {tidyverse} and {medicaldata}. Note that the {tidyverse} package is actually a meta-package that contains 8 packages, and each one has its own version number.
library(tidyverse)
library(medicaldata)
Notice that loading tidyverse led to some conflict messages. The dplyr::filter function masks the stats::filter() function. These two packages, {dplyr} and {stats}, both have a function named filter(). The more recently loaded package is assumed to be the default, so if you call a filter() command, R will use dplyr::filter(). If you want to call the stats::filter() command, you have to explicitly use the package::function() format. If you are not sure which package you loaded last, it can be wise to use the explicit format when calling functions in R.
The other masked function is lag(). The function dplyr::lag() is masking stats::lag(), as {dplyr} was loaded after {stats}. Most of the time this is not a big difference, but every once in a while a conflict between package functions can get very confusing. When in doubt, use the explicit format, in which you call package::function() to make clear what you mean - dplyr::lag() vs. stats::lag().
Note that it is good practice to load all of your packages needed for an R script or an Rmarkdown (.Rmd) document at the beginning of the script or .Rmd. This allows someone else using your script or Rmd to check whether they have the needed packages installed, and install them if needed. In an Rmarkdown document, this is done in a special setup code chunk near the top of the document. If some of these packages are not on CRAN, it is good practice to add a comment (a statement after a hashtag) on how to install this package. For example, in a setup chunk that loads {tidyverse} and {medicaldata}, it is a good idea to add a comment on how to install {medicaldata}, which is not yet on CRAN. See the example below
library(tidyverse)
library(medicaldata)
# the {medicaldata} package can be installed with devtools::install_github('higgi13425/medicaldata')
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
So you have imported your data! Great! Now to start the analysis!
Not so fast, cowboy!
First you need to validate your data.
It is much more exciting to make plots, to make interactive Shiny apps of your models to share on the web, and to knit your Markdown documents to Word or PDF.
But it turns out that most of the truly heinous, embarrassing errors in medical data analysis occur during the process of data wrangling.
Imagine being the star of some of these sordid tales.
After publishing a paper in JAMA in 2019, the authors share their SAS code on Github, and an interested critic noticed that they listed data for 73,000 kidney transplants in the US in one year. But someone familiar with the UNOS data knew that there are about 280,000 kidney transplants per year. During a merge step between two databases, SAS silently over-wrote much of the original data. This discovery led to a retraction of the paper and re-analysis, demonstraing a much smaller effect size. (Gander JC, Zhang X, Ross K, et al. Association between dialysis facility ownership and access to kidney transplantation. Retracted and replaced April 21, 2020. JAMA. 2019;322(10):957-973.) Twitter thread link here https://twitter.com/eric_weinhandl/status/1253127109830156289?s=20
After publishing a report of a randomized controlled trial in COPD in JAMA in 2018 (Aboumatar H, Naquibiddin M, Chung S, et al. Effect of a Program Combining Transitional Care and Long-term Self-management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease. A Randomized Clinical Trial. Retracted and replaced Nov 12, 2018. JAMA. 2018;320(22):2335-2343.), the authors realized that they had miscoded the treatment arms. For their logistic regression analysis, they had to recode the treatment arms from 1 and 2 to 0 and 1. Unfortunately, they flipped the values, and interpreted their results as beneficial. When they realized that the mis-coding changed the result from beneficial to harmful, they reported it to the journal and retracted the paper.
After publishing a report on Best Practices for In-Hospital Cardiac Arrest in JAMA Cardiology, the authors found coding errors in their data. 9 hospitals of 130 had been misclassified, changing some of their associations. (https://jamanetwork.com/journals/jama/article-abstract/2764714)
A med student analyzing a dataset for the first time uses boolean statements to categorize values. But she does not realize that this Stata dataset used “99” for missing values.
Fun text here.
All kinds of crazy examples.
Time series with data from influenza pandemic of 1918-19, perhaps.
This is a book for anyone in the medical field interested in analyzing the data available to them to better understand health, disease, or delivery of care. This could include nurses, dieticians, psychologists, and PhDs in related fields, as well as medical students, residents, fellows, or doctors in practice.
I expect that most learners will be using this book in their spare time at night and on weekends, as the medical school curriculum is already packed full, and there is no room to add skills in reproducible research to the standard curriculum. This book is designed for self-teaching, and many hints and solutions will be provided to avoid roadblocks and frustration.
Tidy forecasting
Feature extraction and Statistics
Rolling anaylsis with window functions.
Slider packagedown page
In this chapter, we will focus on making the descriptive table of the participants in your study, often colloquially know as “Table One”, based on its usual placement in a medical manuscript.
Before we plunge in, I would like to make one point of warning. It is quite common in a multiple-arm randomized controlled trial to compare the distribution of particular baseline characteristics of the subjects between arms with a p value, usually in a column at the far right. This is silly, as this produces a whole column of p values, corresponding to the multiple comparisons performed. With 20 comparisons, by chance, you are likely to get one or more “significant” p values. These are not helpful or meaningful, and are considered bad statistical practice.
Let me quote the CONSORT guidelines on the publications of clinical trials.
“Unfortunately significance tests of baseline differences are still common; they were reported in half of 50 RCTs published in leading general journals in 1997. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis
testing is superfluous and can mislead investigators and their readers. Rather, comparisons at baseline should be based on consideration of the prognostic strength of the variables measured and the size >of any chance imbalances that have occurred.” CONSORT STATEMENT
Despite this, some journals and editors still ask for these p values. Please resist, and quote the CONSORT statement. If you must do this, please do it only under duress.
This is a newer approach which offers many of the same features as tableby. The gtsummary package is a companion to/built upon the gt package, (“gt” for grammar of tables), which is supported by RStudio. The gtsummary package, like gt, is designed to produce nice html output with lots of nice formatting.
However, as a nice bonus, gtsummary includes a neat function as_flextable, which converts your resulting table into a flextable, which can be knit to a Microsoft Word Document or a Powerpoint presentation with Rmarkdown.
This means that you can make a table once, and be able to produce output in HTML for webpages, Microsoft Word for manuscripts, and MS Powerpoint for presentations from the same file without any conversion issues.
The only question is how and when you prefer to format your table. Both gt and flextable have great options for formatting your tables. You can do this in gt, then do as_flextable, or you can convert to a flextable first, then do your formatting. You can choose based on your comfort and familiarity with flextable vs. gt. Both have excellent explanatory websites, with flextable here and gtsummary here.
In the window below, you can:
Give it a try.
A common question in medical research is whether one group had a better outcome than another group. These outcomes can be measured with dichotomous outcomes like death or hospitalization,
but continuous outcomes like systolic blood pressure, endoscopic score, or ejection fraction are more commonly available, and provide more statistical power, and usually require a smaller sample size.
There is a tendency in clinical research to focus on dichotomous outcomes, even to the point of converting continuous measures to dichotomous ones (aka “dichotomania”, see Frank Harrell comments here), for fear of detecting and acting upon a small change in a continuous outcome that is not clinically meaningful.
While this can be a concern, especially in very large, over-powered studies, it can be addressed by aiming for a continuous difference that is at least as large as one that many clinicians agree (a priori) is clinically important (the MCID, or Minimum Clinically Important Difference).
The most common comparison of two groups with a continuous outcome is to look at the means or medians, and determine whether the available evidence suggests that these are equal (the null hypothesis). This can be done for means with Student’s t-test.
Let’s start by looking at the cytomegalovirus data set. This includes data on 64 patients who received bone marrow stem cell transplant, and looks at their time to activation of CMV (cytomegalovirus). In the code chunk below, we group the data by donor cmv status (donor.cmv), and look at the mean time to CMV activation (time.to.cmv variable). Run the code (using the green arrow at the top right of the code chunk below) to see the difference in time to CMV activation in months between groups.
Try out some other grouping variables in the group_by statement, in place of donor.cmv.
Consider variables like race, sex, and recipient.cmv. Edit the code and run it again with the green arrow at the top right.
# insert libraries in each chunk as if independent
library(tidyverse)
library(medicaldata)
cytomegalovirus %>%
group_by(sex) %>%
summarize(mean_time2cmv = mean(time.to.cmv)) ->
summ
summ
## # A tibble: 2 x 2
## sex mean_time2cmv
## <dbl> <dbl>
## 1 0 13.7
## 2 1 12.7
That seems like a big difference for donor.cmv, between 13.7303333 months and 12.7441176 months. And it makes theoretical sense that having a CMV positive donor is more likely to be associated with early activation of CMV in the recipient. But is it a significant difference, one that would be very unlikely to happen by chance? That depends on things like the number of people in each group, and the standard deviation in each group. That is the kind of question you can answer with a t-test, or for particularly skewed data like hospital length of stay or medical charges, a Wilcoxon test.
library(tidyverse)
library(medicaldata)
data <- cytomegalovirus
head(data)
## ID age sex race diagnosis
## 1 1 61 1 0 acute myeloid leukemia
## 2 2 62 1 1 non-Hodgkin lymphoma
## 3 3 63 0 1 non-Hodgkin lymphoma
## 4 4 33 0 1 Hodgkin lymphoma
## 5 5 54 0 1 acute lymphoblastic leukemia
## 6 6 55 1 1 myelofibrosis
## diagnosis.type time.to.transplant prior.radiation
## 1 1 5.16 0
## 2 0 79.05 1
## 3 0 35.58 0
## 4 0 33.02 1
## 5 0 11.40 0
## 6 1 2.43 0
## prior.chemo prior.transplant recipient.cmv donor.cmv
## 1 2 0 1 0
## 2 3 0 0 0
## 3 4 0 1 1
## 4 4 0 1 0
## 5 5 0 1 1
## 6 0 0 1 1
## donor.sex TNC.dose CD34.dose CD3.dose CD8.dose TBI.dose
## 1 0 18.31 2.29 3.21 0.95 200
## 2 1 4.26 2.04 NA NA 200
## 3 0 8.09 6.97 2.19 0.59 200
## 4 1 21.02 6.09 4.87 2.32 200
## 5 0 14.70 2.36 6.55 2.40 400
## 6 1 4.29 6.91 2.53 0.86 200
## C1/C2 aKIRs cmv time.to.cmv agvhd time.to.agvhd cgvhd
## 1 0 1 1 3.91 1 3.55 0
## 2 1 5 0 65.12 0 65.12 0
## 3 0 3 0 3.75 0 3.75 0
## 4 0 2 0 48.49 1 28.55 1
## 5 0 6 0 4.37 1 2.79 0
## 6 0 2 1 4.53 1 3.88 0
## time.to.cgvhd
## 1 6.28
## 2 65.12
## 3 3.75
## 4 10.45
## 5 4.37
## 6 6.87
library(tidyverse)
library(medicaldata)
data %>%
ggplot(mapping = aes(time.to.cmv)) +
geom_density() +
facet_wrap(~sex) +
theme_linedraw()
library(tidyverse)
library(medicaldata)
data %>%
ggplot(mapping = aes(time.to.cmv)) +
geom_histogram() +
facet_wrap(~race)
library(tidyverse)
library(medicaldata)
data$time.to.cmv %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.68261, p-value = 0.0000000001762
library(tidyverse)
library(medicaldata)
df <- msleep
head(df$sleep_total)
## [1] 12.1 17.0 14.4 14.9 4.0 14.4
library(tidyverse)
library(medicaldata)
shapiro.test(df$sleep_total)
##
## Shapiro-Wilk normality test
##
## data: df$sleep_total
## W = 0.97973, p-value = 0.2143
library(tidyverse)
library(medicaldata)
t.test(df$sleep_total, alternative = "two.sided",
mu = 8)
##
## One Sample t-test
##
## data: df$sleep_total
## t = 4.9822, df = 82, p-value = 0.000003437
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
## 9.461972 11.405497
## sample estimates:
## mean of x
## 10.43373
Below is a flipbook.
It illustrates a bit of how to do a t-test.
click on it and you can use the arrow keys to proceed forward and back through the slides, as you add lines of code and more results occur.
Let’s start with a flipbook slide show. When the title slide appears, you can step through each line of the code to see what it does. The right/left and/or up/down arrows will let you move forward and backward in the code.
You can use the arrow keys to go through it one step at a time (forward or backward, depending on which arrow key you use), to see what each line of code actually does.
Give it a try below. See if you can figure out what each line of code is doing.
This is t-testing in action.
library(tidyverse)
library(medicaldata)
prostate <- medicaldata::blood_storage
tabyl(prostate$AA)
## prostate$AA n percent
## 0 261 0.8259494
## 1 55 0.1740506
library(tidyverse)
library(medicaldata)
df %>%
filter(vore %in% c("herbi", "carni")) %>%
t.test(formula = sleep_total ~ vore, data = .)
##
## Welch Two Sample t-test
##
## data: sleep_total by vore
## t = 0.63232, df = 39.31, p-value = 0.5308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.911365 3.650509
## sample estimates:
## mean in group carni mean in group herbi
## 10.378947 9.509375
library(tidyverse)
library(medicaldata)
t.test(x = df$sleep_total, y = df$awake, data = msleep)
library(tidyverse)
library(medicaldata)
t.test(x = df$sleep_total, y = df$awake, data = msleep)
##
## Welch Two Sample t-test
##
## data: df$sleep_total and df$awake
## t = -4.5353, df = 164, p-value = 0.00001106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.498066 -1.769404
## sample estimates:
## mean of x mean of y
## 10.43373 13.56747
library(tidyverse)
library(medicaldata)
library(broom)
df %>%
filter(vore %in% c("carni", "insecti")) %>%
t.test(formula = brainwt ~ vore, data = .) %>%
tidy() ->
result
result
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0577 0.0793 0.0216 1.20 0.253 12
## # … with 4 more variables: conf.low <dbl>, conf.high <dbl>,
## # method <chr>, alternative <chr>
The command line is a simple Terminal window with a prompt at which you can type commands, And do primitive but powerful things to your files. The UNIX computing environment was developed in the 1960s, and is still beloved and fetishized by brogrammers, who believe you are not truly a programmmer if you can’t code from the command line. This is silly.
The major attraction to UNIX in the 1960s is that it was much better than punch cards. Which isn’t saying much. We have had 60 years of software advancement and user interface improvements, so we really should not have to put up with the inherent user hostility of the UNIX environment.
UNIX is an early operating system, which is built around a ‘kernel’ which executes operating system commands, and a ‘shell’ which interprets your commands and sends them to the kernel for execution. The most common shell these days is named ‘bash’, which is a silly recursive brogrammer joke. You will sometimes see references to shell scripts or shell or bash programming. These are the same thing as command line programming.
UNIX is a common under-the-hood language across many computers today, as the Apple iOS is built on top of UNIX, and the various versions of the LinuxOS are built on a UNIX-like kernel, with a similar command shell.
The command line is often the least common denominator between different pieces of open-source software that were not designed to work together. It can occasionally be helpful to build a data pipeline from mismatched parts.
However, there is a lot of low-quality user-hostile command line work involved to get it done, often referred to as “command-line bullshittery”. This is a common bottleneck that slows scientific productivity, and there is a vigorous discussion of it on the interwebs here and here (counterpoint). Essentially, some argue that it is largely a waste of time and effort, while others see it as a valuable learning experience, like doing least squares regression by hand with a pencil.
Running R from the command line is a bit like spending a day tuning your car’s engine by yourself. There is a case to be made that this will improve the efficiency and performance of your car, but it is also usually more efficient to pay someone else to do it, unless you are a car expert with a lot of free time.
You can run R from the command line. It has none of the bells and whistles, nor any of the user conveniences of the RStudio Interactive Developer Environment (IDE). But it is how R was originally expected to be used when it was developed back in 2000 in New Zealand.
Running R from the command line allows you to do powerful things, like process multiple files at once, which can be handy when you have multiple files of sequencing data from distinct observations, or you have a multistep data wrangling pipeline with several slow steps. For many years, this was the only way to easily apply code across multiple files to build a complex data pipeline.
This is much less true today, with tools to handle file paths like the {here} and {fs} packages, run Python scripts from R with the {reticulate} package, run C++ scripts with Rcpp, and run bash, python, SQL, D3, and Stan scripts from Rmarkdown. You can use the {drake} package to manage multi-step data pipelines in different languages (similar to make). But some labs have been doing things at the command line for years, and find it hard to change.
First, you need to open a terminal window. And to do that, you need to find it. This is akin to getting under the hood of a car, and computer makers don’t exactly encourage it.
So, you have managed to open a terminal window, which has a standard UNIX prompt, ending in something like % or $. Not terribly helpful, is it? The bash shell is waiting for you to enter a command.
No user interface for you!
Let’s start with a simple one, which can’t do any harm. Run the command below:
whoami
whoami
## peterhiggins
Remember that UNIX started out as an operating system for terminals, and knowing who was logged in was a helpful thing.
You can string together two commands with a semicolon between them.
Try the following:
whoami;date
## peterhiggins
## Thu Dec 31 18:17:55 EST 2020
OK, fine. This is sort of helpful. It was really important when you were on a terminal and paying by the minute for time on a mainframe back in 1969. And, on occasion, if you will need to use an entire computer cluster to run a script (or scripts) on a lot of data, you will likely have to use some of this command line knowledge. You can even schedule jobs (scripts) to run when your time is scheduled on the cluster with cron and crontab.
At this point, it would be helpful to open a window with your Documents folder, and keep it side by side with the window in which you are reading this e-book. We will start working with files and directories, and it is helpful to see changes in your file/folder structure in real time. As we run commands in the bash shell, check them against what you see in the folder window. You may find that some files (dotfiles, starting with a period) are hidden from the user to prevent problems that occur when these are deleted.
OK, let’s start looking at files and directories. Start with the pwd command, which does not stand for password, but for print working directory.
Run the code below in your Terminal window.
pwd
## /Users/peterhiggins/Documents/RCode/rmrwr-book
You can see the full path to your current directory. This can be a bit obscure if you are just looking at your folder structure, particularly at the beginning of the path. Fortunately, the {here} package handles a lot of this for you when you are working in Rstudio projects.
We think of the directory as a tree, with a root - in this case, Users, and various branches as you build out folders and subfolders.
We can move up and down the folders of the directory paths with the cd command, for change directory.
Try this command in your Terminal Window, and see if you can figure out what it does.
cd ..
It changes the directory up one level closer to the root directory. It is straightforward to go up the directory tree, as each folder only has one parent. But it is tricky to go down the directory tree, as there are many possible branches/children, and you do not inherently know the names of these branches. We need to list the contents of your current directory with ls to know what is there.
Try the ls command in your Terminal window
cd /Users/peterhiggins/Documents/;
ls
## 1FQ_Crohn's Disease_23Oct2020 (002).doc
## 2020-Jun-05 AGA IMIBD meeting notest.docx
## 2021 AGA Invited Speaker Session Basic Hybrid Example.pdf
## 2021.Higgins AGA Distinguished Clinician.CO.docx
## A is for Allspice.2.0.docx
## A is for Allspice.docx
## ABT263_HIO_report_toWord.docx
## AGA IMIBD
## AGA IMIBD Councilor Career Discussion Guide.docx
## AGA IMIBD Webinar Outline.docx
## AIBD CAM Higgins.pdf
## AIBD CAM Higgins.pptx
## AIBD SoMe Higgins.pdf
## AIBD SoMe Higgins.pptx
## AIBD agreement.docx
## AIBD20Template.pptx
## AMAG DDW Clear draft_PDRH comments.docx
## APG1244_Milestone_report.docx
## ASUC_UC_protocol_comments_2020.docx
## A_Woodward_Score Sheet_PDRH.docx
## Accounts and Access (1) (1).docx
## Advice for participants in webinars.docx
## Animation of NSAID.pptx
## BKochar_Frailty.pdf
## BM recommendation.docx
## Beginners_GuideToR.pdf
## Biosketch for K.pptx
## Biosketch_2020_Higgins_ClinResIBD_biosketch.doc
## Brazil.ItineraryNov2015.docx
## Butter BCS Chicken.docx
## CAS.K.candidate.background_SB_PDRH.docx
## CAS.T32.Project.Description-JS.docx
## CAS.career.goals.obj.development.training_PDRH.docx
## CC360_The Risk of SARS.R1.docx
## CC360_The Risk of SARS.docx
## CCF IBD Webcast 2020 Draft Deck_For Review.pptx
## CCFA EIC Candidate Interview Questions (candidates) jobin[1].doc
## CDC_proposal1.1.docx
## CLARE STOCKS.docx
## COVID Trials Feasibility
## CaltechCampus Tour & Information Session.webarchive
## Cancel Appt Epic.ppt
## Causal.png
## CellDeath_DDW_2021_ISS.pdf
## Chu RPG Review_PDRH.docx
## Clare Investment Summary.docx
## Council Conversations Author Chat Guide.docx
## Coursera_Programming in R Notes.docx
## CoverLetterPlus.pptx
## Crash&Burn_ScriptV2_100318 copy.pdf
## DataCamp Courses by Topic.docx
## DeEscalationACG2016.pptx
## Demographics.pdf
## Documents.Rproj
## DrHiggins IBD Data Request.xlsx
## Draft Postop IBD Surgery Care Protocols v2_SERedit.docx
## ECCO 2016 Amsterdam Schedule.docx
## ECCO 2019 UC PRO SS Abstract D1f_JP_UA_YO_AM_PDRH.docx
## ECCO2016Lycera30937.pptx
## Effect of medications on the recurrence of cancer in IBD patients.docx
## Electrical engineering interview questions.docx
## FDAtofaResponse.docx
## FFMI Kickstart-FinalReport 5-20-16-LJ.docx
## FITBITProtocol_28NOV2016_AbbVie.docx
## FITBITProtocol_4DEC2016_AbbVie.docx
## FMT_DDW_2021_ISS.pdf
## FibrosisIBDCedars2016.pptx
## Figures-KC-JAMA.pptx
## Finance and Retirement Plans.docx
## Financial Priorities.docx
## Garmin Notes.docx
## General Social Media Tips.docx
## General thoughts about query letters.docx
## Getting Started with REDCap.docx
## Git for MDs_2.pptx
## GitHub
## Github for MDs_1.pptx
## Glover_RPG_Review_PDRH.docx
## GoToMeeting Chats
## GradPartyHigginsInvites.xlsx
## HPI-5016 IBD Patient Contact Info.xlsx
## HS movie.docx
## Higgins AGA Webinar Slides.pptx
## Higgins Bio.docx
## Higgins New IBD.pptx
## Higgins Refractory Proctitis.pptx
## Higgins biosketch2015KRao.doc
## Higgins biosketch2016KRao.doc
## Higgins-peter.jpg
## HigginsACGMidwest2019_PerioperativeIBD.pptx
## Higgins_LOS_IBDBiobank_Shah_Nusrat_2019.docx
## Higgins_UM_CME_Pregnancy in IBD.pptx
## How To Log in to IBD Server.docx
## How To Log in to RStudio Server for HigginsLab.docx
## How To Log in to RStudio Server for Shiny.docx
## IBD 2020 - Honorarium reimbursement Form.docx
## IBD Biobank Cryostor.pptx
## IBD Clinical Trials for MDsDearborn2017.pptx
## IBD Insurance Pilot Results.docx
## IBD Insurance Survey for CCFA Partners Existing.docx
## IBD Journal Club 13Feb2017.docx
## IBD Journal Club July 11.docx
## IBD Plexus meeting 21 Sep 2015 notes.docx
## IBD School 322 Script.docx
## IBD School 324 Script.docx
## IBD School 325 Script.docx
## IBD and biologics tweets.docx
## IBD inbox coverage.docx
## IBDInsuranceSurvey3.docx
## IBDMentoringConferenceCall4AbstractsPH.docx
## IBD_Deescalation_Apr_2019_PDRH.docx
## IBDforLansing2017.pptx
## IMG_0006.jpg
## IMG_0008.jpg
## IMG_1523st.jpg
## IMIBD Councilors 2020-21.docx
## IMIBD Partners insurance 2020DDW.pptx
## IMIBD_expanded_descriptors.xlsx
## Introduction to Application Supplement Photoacoustic.docx
## JAK_DDW_2021_ISS.pdf
## JAMA_KC_Second JAMA.docx
## JAMA_Review_on_CD_Revisions_Tracked_Changes with edits_PDRH.docx
## JB_V1 Career Goals and Objectives 7.8.2020_PDRH.docx
## JB_V2 Candidate’s Background 7.7.2020_PDRH.docx
## JDix_Study_update.docx
## K Award Institutional Letter of Commitment.pptx
## K Candidate Section.pptx
## K105_Melmed_PROs in Practice_MB_bb_JLS.pptx
## K23 Aims - Shirley Cohen-Mekelburg 11.14.19.docx
## K23_morph_measurements_MockupManuscript_21JAN2019.docx
## Learning R discussion Jeremy Louissaint.docx
## Letter to Frank Hamilton.docx
## Lin_Reviewer Score_PDRH.docx
## Log in to IBD Server.docx
## MEI_2020_PH_W9.pdf
## MEI_ACH_Wire Transfer Form.docx
## MIM-TESRIC PROTOCOL_Higgins_14Apr2020.docx
## MIM-TESRIC PROTOCOL_Higgins_26Aug2020.docx
## Managment of CD.pptx
## Manuscript v1.docx
## Manuscript v2.PDRH.docx
## McDonald, Nancy.pdf
## Megan McLeod Rec Letter Residency.docx
## MentoringAgendaDraftPH.docx
## Meta analysis TB vs CD version 3.5.docx
## Michigan Medicine Gastroenterology Social Media Initiative.docx
## Michigan Medicine Model for COVID-19 Clinical Trial Oversight DRAFT (KSB 04.17.20)-AL-PDRH.docx
## Microsoft User Data
## MultidisciplinaryIBDClinicPHv2.docx
## NordicTrackTC9iTreadmillManual.pdf
## Oct2019payPDRH.PDF
## Odd college lists.docx
## P Singh K grant aims 8-25_PDRH.docx
## P2PEP slide 2020
## P2PEP slide 2020.pptx
## PHcv2019.docx
## PHcv2020.docx
## PRO agenda videos VINDICO.docx
## PRO letter.docx
## PS_K grant aims 6-25_PDRH.docx
## PTM LOS From PDRH.docx
## PTM LOS From PDRH.pdf
## Pearson 5 Notes.docx
## Perils of Excel.pptx
## Personal statement version 3!.docx
## Pitch Letter - S is for Saffron.docx
## Poppy Eulogy backup.docx
## Poppy Eulogy.docx
## Possible Eastern College Tour.docx
## Powerpoint
## Prashant Rec Letter.docx
## Prashant Rec Letter.pdf
## PredictingIBD_DDW_2021_ISS.html
## PredictingIBD_DDW_2021_ISS.pdf
## Purdue Disclosure Form_Higgins.docx
## Question 16.docx
## RCode
## Ramp up clinical research_PH.xlsx
## Ramping up human subject research - MM 6-1-20 _KDA_PDRH_suggestions.docx
## Recordings
## Review Criteria for COVID Clinical Trials.docx
## Review guidelines_2017.docx
## Roasted Salted Cashews.docx
## S is for Saffron 3.0.docx
## S is for Saffron 3.1.docx
## S is for Saffron 3.2.docx
## S is for Saffron.2.0.docx
## SEAN STOCKS.docx
## SIG_Template_IBD Program_FINAL.docx
## Sean Common App academic honors list.docx
## Sean Common App activities list.docx
## Sean Higgins Bordogni.mp4
## Sean Higgins Brag Sheet.docx
## Sean Investment Summary.docx
## Sean Resume Tabular VBorder.docx
## Sean Resume Tabular.docx
## Sean Resume.docx
## Sean Summer Priorities 2016.docx
## SecureIBD.pptx
## ShareRmd.html
## Sherman Prize Nominee Questions.docx
## Shoreline West Tour Information.docx
## Short PA slides.pptx
## Shotwave thread.docx
## Signing Clinical Research Infusion Orders.pdf
## SingleCell_DDW_2021_ISS.pdf
## SoMe_use_2020.png
## Social Media for GI.pptx
## Source Code PT1.docx
## Stelara paper.docx
## T32_current_text_14June2019.docx
## TOPPIC ML draft v5SCM_YL_AKW_PDRH.docx
## TabaCrohn IBD J club.docx
## Tables.docx
## Takeda_IBD School Videos_Submission.pdf
## Task List 2020-2.docx
## Task List 2020-5.docx
## Task List 2020.docx
## Testing signatures with Adobe.pdf
## The Risk of SARS.R1.Markup.docx
## Tidymodels.docx
## Tofa in ICI Figure Legends_Final Draft_V2.docx
## Tofa inpatient induction Protocol_02NOV2018_PHforEdits.docx
## Toffee Separation Tips.docx
## UCRx_DDW_2021_ISS.pdf
## UC_protocol_comments_2020.docx
## UM IBD Clinical Trials IBD referral form.docx
## UPA_U_ACHIEVE 1st draft_PDRH.docx
## VINDICO_PRO.pptx
## VideoVisitSchedulingQuickApptsforProviders.pdf
## VincentChen_K specific aims 2020-10-25.docx
## VirtualPtEdMar2020.v2.pdf
## WebEx
## Zoom
## Zwift
## Zwift-Gift-Card.pdf
## aga institute council july 2020 meeting.pdf
## algorithms_thiopurine.pdf
## base-r-cheatsheet.pdf
## biomakers_fibrosisPDRH.docx
## bmj_imputation.pdf
## cgh_factors_utilization.pdf
## cycling core exercises.docx
## draft_tokenization letter Risa_Uste.docx
## early-career-faculty_Dec-2020.xlsx
## epic cancel_reschedule appointments.ppt
## epic schedule viewing_close.ppt
## escalator.html
## fellow graduation 2020.docx
## hexStickers.jpg
## higgins2x3.jpg
## iBike Rides
## learnr app diagram.jpg
## learnr app diagram.pptx
## letter Lowrimore.docx
## mockstudy manuscript draft.docx
## nejm1966_beecher_ethics.pdf
## nejm_indomethacin.pdf
## nejm_statins.pdf
## pdrh_IBD_email.xlsx
## personal statement fellowship_PDRH.docx
## peterhiggins.jpg
## seq-6.pdf
## signature.docx
## signature.fld
## signature.html
## signature.pdf
## signature.png
## stiff_bcl.R
## submitJanssen_IBD School Videos_12Jul2018.pdf
## tidyr_pivot.png
## tidyr_pivot.xcf
## ucla1.jpg
## untidy_sheets.pptx
## wga_min20.pdf
## ~$T Review Higgins.docx
## ~$sk List 2020-5.docx
## ~$sk List 2020.docx
You will see a listing of all files and folders in the current directory. You can get more details by adding the option (sometimes called a flag) -l
cd /Users/peterhiggins/Documents/;
ls -l
The full listing will give you more details, including read & write permissions, file size, date last saved, etc.
Many commands have options, or flags, that modify what they do.
Find a folder inside of your Documents folder. We will now go down a level in the directory tree. In my case, I will use the Powerpoint folder.
In your Terminal window:
cd /Users/peterhiggins/Documents/Powerpoint;
ls
## 2016IBDClinTrialsforMDsDearborn.pptx
## 2016IntegratedDeckorMDsGB.pptx
## 2019 SCSG GI Symposium IBD SoA - Read-Only.pptx
## BE LGD Dearborn 2016.04.12.pptx
## Getting Started in RStudio.pptx
## Higgins Microbiota for IBD Patient Ed.pptx
## HigginsDec2018AJG_SmokingStatus.pptx
## IBDUpdate.pptx
## Integrated Slide Deck Dearborn 2016.04.12.pptx
## MER Stress Management Dearborn 4-14.pptx
## MichiganMedicine-IBDTemplate.potx
## PDRH RCAR 2020.pptx
## PennThioMTX2017Higgins.pptx
## Pregnancy in IBD.pptx
## Regenbogen CRS for GI CME Course2016.pptx
## Senior Slide Show.pptx
## ThomsonRectalStumpComplicationsIBD2_13.pptx
## UEGweek2020.pptx
## UMHS Talk- Moving Beyond AntiTNF 4-2016 FINAL v2.pptx
## Vertebrate Animals for K.pptx
## VirtualPtEdMar2020.v2.pptx
## Writers Room.pptx
## ibd_meds_surgery_metan.pptx
Great!
You moved to a new directory and listed it.
Now we will get fancy, and make a new directory within this directory with the mkdir command.
Try this in your Terminal window:
pwd;
mkdir new_files;
ls
You have now made a new directory (folder) within the previous directory, named new_files. Verify this in your Documents folder.
You can now change to this directory
and list the contents (it should be empty).
Try this out in your Terminal Window (note edit the cd command to your own directory path).
cd /Users/peterhiggins/Documents/Powerpoint/new_files;
ls
Note that you can abbreviate the current directory with ., so that you could have also used cd ./new_files
You can create a new (empty) file in this directory with the touch command.
Sometimes you need to create a new file, then write data to it.
Try this out
touch file_name;
ls
You can also create a file with data inside it with the cat > command.
Type in the following lines into your Terminal window. When complete, type control-D to be done and return to the Terminal prompt.
cat stands for concatenate.
cat > file2.txt
cat1
cat2
cat3
Now you can list the contents of this file with the cat command below.
Give this a try
cat file2.txt
You can also list the directory of your new_files folder with ls to see the new folder contents.
Try this
ls
Note that you don’t need to use the Terminal to run bash commands. You can do this from an Rmarkdown file.
Take a moment to run pwd in your Terminal, to get the current directory.
Now open Rstudio, and a new Rmarkdown document.
Copy the path to the current directory from the Terminal.
Switch back to the Rmarkdown document.
Select one of the R code chunks (note the {r} at the top) and delete it.
Now click on the Insert dropdown at the top of the document, and insert a Bash chunk.
Now add UNIX commands (separated by a semicolon), like
cd (paste in path here);
pwd;
ls;
cat file2.txt
Then run this chunk.
Now you can run terminal commands directly from Rmarkdown!
OK, now we are done with the file file2.txt and the directory new_files.
Let’s get rid of them with rm (for removing files) and rmdir for removing directories.
In order, we will
- Make sure we are in the right directory
- remove the file with rm file2.txt
- go up one level of the directory with cd ..
- remove the directory with rmdir new_files
Give this a try
pwd;
rm file2.txt;
cd ..;
rmdir new_files
Verify all of this in your Documents window.
This is great. But you can imagine a situation in which you mistakenly rm a file (or directory) that you actually needed. Unlike your usual user interface, when a file is removed at the command line, it is gone. It is not in the trash folder. It is gone. There is something to be said for modern user interfaces, which are built for humans, who occasionally make mistakes. Sometimes we do want files or folders back.
Here are some file commands worth knowing
cat filename - to print out whole file to your monitorless filename - to print out the first page of a file, and you can scroll through each page one at a timehead filename - print first 10 lines of a filetail filename - print last 10 lines of a filecp file1 file2 - copy file1 to file2mv file1.txt file.2.txt file3.txt new_folder - move 3 files to a new folderSo now you can get around directories, and find your files in the Terminal window, but you really want to run R.
You can launch an R session from the Terminal Window (if you have R installed on your computer) by typing the letter R at the Terminal prompt
Launch R
R
You get the usual R intro, including version number, and the R> prompt.
Now you can run R in interactive mode with available datasets, or your own datasets.
Try a few simple commands with the mtcars dataset.
Give the examples below a try.
You can use q() to quit back to the terminal (and reply “n” to not save the workplace image).
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1
mtcars %>%
filter(mpg > 25) %>%
select(starts_with('m')|starts_with('c'))
## mpg cyl carb
## Fiat 128 32.4 4 1
## Honda Civic 30.4 4 2
## Toyota Corolla 33.9 4 1
## Fiat X1-9 27.3 4 1
## Porsche 914-2 26.0 4 2
## Lotus Europa 30.4 4 2
Sometimes you will want to call R, run some code, and be done with R.
You can call R, run a few lines, and quit in one go.
Just add the flag -e (for evaluate) to the call to R,
and put the R commands in quotes.
Try the example below
(note that this will not work if you are still in R - be sure you are back in the terminal with the % or $ prompt)
R -e "head(mtcars)"
or this example - note that single or double quotes does not matter - as long as they match.
Try this
R -e 'install(palmerpenguins)'
You can also string together several commands with the semicolon between them.
Try the example below.
R -e 'library(palmerpenguins);data(penguins);tail(penguins)'
Now we are stepping up a level - you have an R script that you have carefully created and saved as the myscript.R file. How do you run this from the Terminal?
This is easy - just call the Rscript command with your file name.
Pick out a short R file you have written, make sure you are in the right directory where the file is, and use it as in the example below.
Rscript myscript.R
This launches R, runs your script, saves resulting output (if your script includes save or ggsave commands), closes R, and sends you back to the Terminal. Very simple.
This is a little different, as you can’t just run an Rmarkdown file. Normally you would use the dropdown button to knit your file from Rstudio. But you can use the rmarkdown::render command to render your files to HTML, PDF, Word, Powerpoint, etc. Pick out a simple Rmd file like output_file.Rmd below, make sure you are in the right directory where the file is, and try something like the example below.
Note that this is one case where nesting different types of quotes (single vs. double) can come in handy.
It helps to use single quotes around your filename and double quotes around the rmarkown::render command.
Try it out
Rscript -e "rmarkdown::render('output_file.Rmd')"
So there you have it!
Just enough to get you started with R from the command line.
This book is published on bookdown.org, where you can create an account to publish your own e-book and share it with the world.
Once you have an account,
Install the {bookdown} package, with install.packages('bookdown').
Then run library(bookdown) in the Console to load the package.
Then, in the RStudio IDE, Choose File/New Project/Book Project using bookdown.
Then go to the Files tab, open index.Rmd, and click the Knit button. The Preview Window will show you a minimal example of a bookdown book. You can start editing and adding chapters.
You can edit your _bookdown.yml file, which controls the setup of your book.
My _bookdown.yml file looks like this:
book_filename: "rmrwr"
title: "Reproducible Medical Research with R"
language:
ui:
chapter_name: "Chapter "
delete_merged_file: true
new_session: yes
rmd_files:
- index.Rmd
- io02-getting-started.Rmd
- io03-tasting.Rmd
- io65-error_messages.Rmd
- io04-updating.Rmd
- io07-major-updates.Rmd
- io08-data-validation.Rmd
- io09-timeseries.Rmd
- io10-tableOne.Rmd
- io30-ttest.Rmd
- io70-r_cmd_line.Rmd
- io98-title-holder.Rmd
- io99-references.Rmd
You can edit your _output.yml file, which controls the output and look of your book.
My _output.yml file looks like this:
bookdown::gitbook:
css: style.css
config:
toc:
before: |
<li><a href="./">RMRWR</a></li>
after: |
<li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
edit: https://github.com/rstudio/bookdown-demo/edit/master/%s
download: ["pdf", "epub"]
bookdown::pdf_book:
includes:
in_header: preamble.tex
latex_engine: xelatex
citation_package: natbib
keep_tex: yes
bookdown::epub_book: default
Note that this refers to a style.css file, which affects the appearance of your book.
My style.css file looks like this:
@import url('https://fonts.googleapis.com/css?family=Abril+Fatface|Source+Sans+Pro:400,400i,700,700i|Lora:400,400i,700,700i&display=swap');
p.caption {
color: #777;
margin-top: 10px;
}
p code {
white-space: inherit;
}
pre {
word-break: normal;
word-wrap: normal;
}
pre code {
white-space: inherit;
}
/* Desiree custom css */
/* next 3 rules for setting large image at top of each page and pushing book content to appear beneath that */
/*
.hero-image-container {
position: absolute;
top: 0;
left: 0;
right: 0;
height: 390px;
/*background-image: url("images/books.jpg");
background-color: #2F65A7;
}*/
/*.hero-image {
width: 100%;
height: 390px;
object-fit: cover;
}*/
/*.page-inner {
padding-top: 440px !important;
}*/
/* Links */
.book .book-body .page-wrapper .page-inner section.normal a {
color: #702082;
}
/* Body and header text */
.book.font-family-1 {
font-family: 'Source Sans Pro', arial, sans-serif;
}
h1, h2, h3, h4 {
font-family: 'Lora', arial, sans-serif;
}
.book .book-body .page-wrapper .page-inner section.normal h1,
.book .book-body .page-wrapper .page-inner section.normal h2,
.book .book-body .page-wrapper .page-inner section.normal h3,
.book .book-body .page-wrapper .page-inner section.normal h4,
.book .book-body .page-wrapper .page-inner section.normal h5,
.book .book-body .page-wrapper .page-inner section.normal h6 {
margin-top: 1em;
margin-bottom: 1em;
}
.title {
font-family: 'Lora';
font-size: 3em !important;
color: #2f65a7;
margin-top: 0.275em !important;
margin-bottom: 0.35em !important;
}
.subtitle {
font-family: 'Lora';
color: #2f65a7;
}
/* DROP CAPS*/
/*p:nth-child(2):first-letter { /* /* DROP-CAP FOR FIRST P BENEATH EACH H1 OR H2*/ /*
color: #2f65a7;
float: left;
font-family: 'Abril Fatface', serif;
font-size: 7em;
line-height: 65px;
padding-top: 4px;
padding-right: 8px;
padding-left: 3px;
margin-bottom: 9px;
}
*/
/* try the below with the ~ instead...or just the space?) */
.section.level1 > p:first-of-type:first-letter { /*drop cap for first p beneath level 1 headers only within class .section*/
color: #2f65a7;
float: left;
font-family: 'Abril Fatface', serif;
font-size: 6em;
line-height: 65px;
padding-top: 4px;
padding-right: 8px;
padding-left: 3px;
margin-bottom: 9px;
}
/* add drop cap to first paragraph that follows the first 2nd level header*/
/*
.section.level2:first-of-type > p:first-of-type:first-letter {
color: #2f65a7;
float: left;
font-family: 'Abril Fatface', serif;
font-size: 7em;
line-height: 65px;
padding-top: 4px;
padding-right: 8px;
padding-left: 3px;
margin-bottom: 9px;
}
*/
/* TOC */
.book .book-summary {
background: white;
border-right: none;
}
.summary{
font-family: 'Source Sans Pro', sans-serif;
}
/* all TOC list items, basically */
.book .book-summary ul.summary li a, .book .book-summary ul.summary li span {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 15px;
padding-right: 15px;
color: #00274c;
}
.summary a:hover {
color: #ffcb05 !important;
}
.book .book-summary ul.summary li.active>a { /*active TOC links*/
color: #d86018 !important;
border-left: solid 4px;
border-color: #d86018;
padding-left: 11px !important;
}
li.appendix span, li.part span { /* for TOC part names */
margin-top: 1em;
color: #000000;
opacity: .9 !important;
text-transform: uppercase;
}
.part + li[data-level=""] { /* grabs first .chapter immediately after .part...but only those ch without numbers */
text-transform: uppercase;
}
ul.summary > li > a { /* The > selects all the li's which are immediately within the class summary*/
font-family: 'Source Sans Pro', sans-serif;
}
/* The next two rules make the horizontal line go straight across in top navbar */
.summary > li:first-child {
height: 50px;
padding-top: 10px;
border-bottom: 1px solid rgba(0,0,0,.07);
}
.book .book-summary ul.summary li.divider {
height: 0px;
}
/* source code copy button */
.copy {
width: inherit;
background-color: #e2e2e2 ;
border: none;
border-radius: 2px;
float: right;
font-size: 60%;
padding: 4px 4px 4px 4px;
}
/* Two columns */
.col2 {
columns: 2 200px; /* number of columns and width in pixels*/
-webkit-columns: 2 200px; /* chrome, safari */
-moz-columns: 2 200px; /* firefox */
}
.side-by-side {
display: flex;
}
.side1 {
width: 40%;
}
.side2 {
width: 58%;
margin-left: 1rem;
}
/* -------------- div tips-------------------*/
div.warning, div.tip, div.tryit, div.challenge, div.explore {
border: 4px #dfedff; /* very light blue */
border-style: solid;
padding: 1em;
margin: 1em 0;
padding-left: 100px;
background-size: 70px;
background-repeat: no-repeat;
background-position: 15px center;
min-height: 120px;
color: #00274c; /* blue text */
background-color: #bed3ec; /* light blue background */
}
div.warning {
background-image: url("images/warning.png");
background-color: #f7f7f7; /* gray97 background */
}
div.tip {
background-image: url("images/tip.png");
background-color: #fff7bc; /* warm yellow background */
}
div.tryit {
background-image: url("images/tryit.png");
background-color: #edf8fb; /* light blue background */
}
div.challenge {
background-image: url("images/challenge.png");
color: #4b0082; /* indigo text */
background-color: #ffe1ff; /* thistle background */
}
div.explore {
background-image: url("images/explore.png");
background-color: #d0faee; /* green card background */
}
/* .book .book-body .page-wrapper .page-inner section.normal is needed
to override the styles produced by gitbook, which are ridiculously
overspecified. Goal of the selectors is to ensure internal "margins"
controlled only by padding of container */
.book .book-body .page-wrapper .page-inner section.normal div.rstudio-tip > :first-child,
.book .book-body .page-wrapper .page-inner section.normal div.tip > :first-child {
margin-top: 0;
}
.book .book-body .page-wrapper .page-inner section.normal div.rstudio-tip > :last-child,
.book .book-body .page-wrapper .page-inner section.normal div.tip > :last-child {
margin-bottom: 0;
}
iframe {
-moz-transform-origin: top left;
-webkit-transform-origin: top left;
-o-transform-origin: top left;
-ms-transform-origin: top left;
transform-origin: top left;
}
.iframe-container {
overflow: auto;
-webkit-overflow-scrolling: touch;
border: #ddd 2px solid;
box-shadow: #888 0px 5px 8px;
margin-bottom: 1em;
}
.iframe-container > iframe {
border: none;
}
Each chapter was created in R Markdown, with R code chunks, flipbooks, an learnr apps as exercises.
Note that each chapter should start with a level 1 header, which will be the title of the chapter. Each level 1 header starts with a single hashtag, then a space, then the text of the title.
You can save draft chapters without necessarily publishing them to the final book. They will not be included until you list them in your _bookdown.yml file.
After saving and knitting each chapter successfully, the finalized chapters can be included in the book build, and ordered, by adding them to the _bookdown.yml file, in between index.Rmd, and io98-title-holder.Rmd.
The names of each chapter follow the convention, io##-Topic.Rmd. This is so that they will alphabetically follow index.Rmd and largely be in order.
Add the new chapter to the list of chapters in order in _bookdown.yml, somewhere in between
- index.Rmd and
- io98-title_holder.Rmd
Render the book with bookdown::render_book('index.html')
Publish the book with
bookdown::publish_book(account = 'pdr_higgins')
Then commit the changes and push to Github
Within a minute or three, the updated book will appear at:
https://bookdown.org/pdr_higgins/rmrwr/
More details can be found at:
https://bookdown.org/yihui/bookdown/rstudio-connect.html
and at